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Abstract 

In this paper we study the effect of energy parameters on minimum free energy (mfe) 

RNA secondary structures. Employing a simplified combinatorial energy model, that 

is only dependent on the diagram representation and that is not sequence specific, we 

prove the following dichotomy result. Mfe structures derived via the Turner energy 

parameters contain only finitely many complex irreducible substructures and just 

minor parameter changes produce a class of mfe-structures that contain a large 

number of small irreducibles. We localize the exact point where the distribution 

of irreducibles experiences this phase transition from a discrete limit to a central 

limit distribution and subsequently put our result into the context of quantifying 

the effect of sparsification of the folding of these respective mfe-structures. We 

show that the sparsification of realistic mfe-structures leads to a constant time and 
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space reduction and that the sparsifcation of the folding of structures with modified 
parameters leads to a linear time and space reduction. We furthermore identify the 
limit distribution at the phase transition as a Rayleigh distribution. 

Keywords: RNA secondary structure, loops-based, energy model, dominant singu- 
larity, limit distribution 



1. Introduction 



An RNA molecule is described by its primary structure, a linear string composed of 
the nucleotides A, G, U and C, referred to as the backbone. Each nucleotide can 
form a base pair by interacting with at most one other nucleotide by establishing 
hydrogen bonds. Here we restrict ourselves to Watson-Crick base pairs GC and AU 
as well as the wobble base pairs GU. In the following, base triples as well as other 
types of more complex interactions are neglected. RNA structures can be presented 
as diagrams by drawing the backbone horizontally and all base pairs as arcs in the 
upper halfplane; see Figured! This set of arcs is tantamount to our notion of coarse- 
grained RNA structure. In particular, we shall ignore any spatial embedding or 
geometry of the molecule beyond its collection of base pairs. Accordingly, particular 
classes of base pairs translate into specific structure categories, the most prominent 
of which are secondary structures [9j [T2J, [161 E] • When represented as diagrams, 
secondary structures have only non-crossing base pairs (arcs). In the following an 
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FIGURE 1. (A) An RNA secondary structure and (B) its diagram representation. 

RNA secondary structure is tantamount to a diagram without any crossing arcs. 
The combinatorics and prediction of RNA secondary from primary structure was 
pioneered three decades ago by Michael Waterman [TTJ [16], [TJ HE] • 

RNA structures are a result of a folding of the primary sequence. The folded config- 
urations are energetically somewhat optimal. Here energy means free energy, which 
is dominated by the loops forming between adjacent base pairs and not by the hy- 
drogen bonds of the individual base pairs [10] . In addition sterical constraints imply 
certain minimum arc-length conditions for minimum free energy configurations |14j . 

For a given RNA sequence polynomial-time dynamic programming (DP) algorithms 
can be devised, finding such minimal energy configurations. The most commonly 
used tools predicting simple RNA secondary structure mf old [20] and the Vienna 
RNA Package [6], are running at 0(N 2 ) space and 0(N 3 ) time solution. 

In the context of polynomial-time DP algorithms a particular method, the spar- 
sification has been devised. Sparsification is tailored to speed up DP-algorithms 
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predicting minimum free energy (mfe)-secondary structures pjH [T] by pruning cer- 
tain computation paths encountered in the DP-recursions, see Fig. [2j In the context 




FIGURE 2. Sparsification: suppose the optimal solution Ijj is obtained 
from the optimal solutions Ik+i,p and I p +ij- Based on the recursions 
of the secondary structures, 1^ and Ik+l,p produce an optimal solution 
of Ii tP . Similarly, Ik+i, P and Ip+ij produce an optimal solution of Iu+ij- 
Now, in order to obtain an optimal solution of ij j it is sufficient to consider 
either the grouping and I p +ij or ij^ and Ik+i,j- 

of the folding of RNA secondary structures, sparsification reduces the DP-recursion 
paths to be based on so called candidates. A candidate is in this case an interval, for 
which the optimal solution cannot be written as a sum of optimal solutions of sub- 
intervals. Tracing back these candidates gives rise to irreducible structures and the 
crucial observation is here that these irreducibles appear only at a low frequency. 
This means that there are only relatively few candidates have to be considered, 
which in turn implies a significant reduction in time and space complexity. 

In this paper we study the effect of the particular choice of energy parameters on 
sparsification. In other words, we study what happens to RNA structures of a 
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certain energy and quantify the effect of sparsification of their mfe-folding. We shall 
see that this energy filtration of secondary structures can have a dramatic effect on 
the structures having significant implication for the effect of sparsification. 

The energy parameters are associated to the loops of a secondary structure, they are 
empirically measured enthalpic and entropic terms that depend on loop sequence, 
length and type [13], [TO]. We shall restrict ourselves to a simplified notion of energy 
that does not take into account the specifics of nucleotides but only depends on the 
combinatorial representation the secondary structure. For instance, the free energy 
of a hairpin loop H, Gh, is given by 

(1.1) G H — Q!i + a 2 £ H + a 3 and Qh = v Gh , 

oti being the penalty for forming H, a 2 the penalty for an unpaired base, the 
score associated to a tetra-loop, in denoting the number of unpaired bases and Qh 
being the weight of H. The other two loop-types are treated along these lines. 

Equipped with this notion of "combinatorial" energy we study the energy filtration of 
RNA secondary structures. In light of the above discussion about the candidates and 
sparsification we pay particular attention to irreducible RNA secondary structures. 
One key question here being: under which conditions does the probability of finding 
an irreducible minimum-free energy structure tend to zero for larger and larger 
sequences? 
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The main results of this paper are as follows: we will show that 

• for energy-parameters mimicking the established Turner energy model [10] 
sparsification implies a time and space reduction by a constant factor, 

• there exist energy-parameters close to those of the Turner energy model [10] 
for which sparsification implies a linear time reduction, 

• the effect of sparsification is closely connected to the distribution of irre- 
ducibles within secondary structures. To be precise we prove that this dis- 
tribution undergoes a phase transition from a discrete limit law to a central 
limit law, 

• the limit distribution of irreducibles at the phase transition is a Rayleigh-law 
and a DP-folding in this regimen experiences a linear reduction in space and 
time. 

2. Some basic facts 

As mentioned above, we present an RNA secondary structure as a diagram by 
drawing its backbone as a horizontal line containing vertices corresponding to the 
labels of the nucleotides and each Watson-Crick base pair as an arc or chord in the 
upper halfplane. Consequently, the diagram representation ignores the particular 
type of nucleotides. Any such pair may be inserted in two positions i,j incident to 
an arc as long as it is compatible with the Watson-Crick and G-U base paring rules. 
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The length of one chord is i = j—i and in this paper, we only consider RNA sec- 
ondary structure with chord length > 4, see Fig. |3j A chord connecting the first and 
last vertices is called a rainbow and a RNA secondary structures exhibiting a rain- 
bow is an irreducible RNA secondary structure. This notion of irreducibility comes 
up naturally when one decomposes a structure by means of cutting the backbone in 
two positions without breaking any arcs. Let 5?{n) and ^(n) denote the collections 
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FIGURE 3. A diagram representation of secondary structure with > 4. 

of all linear chord diagrams of RNA secondary structures and irreducible secondary 
structures on n vertices, respectively. Let s(n) and c(n) denote the cardinalities of 
these sets. Furthermore, setting S = \J n >o<9"(n) and 6 = U n > < ^ 7 (n) denote the set 
of all linear chord diagrams of RNA secondary structures and irreducible secondary 
structures, let s G S and c G C denote a §- or C-structure. 

Let J?(n,j,i) D ^ (n, i) denote the collections of RNA secondary structures and 
irreducible RNA secondary structures on n > vertices having length n, energy j 
and weight i, then 




(2.1) 



n>0 j i>0 




n>0 j i>0 
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Thermodynamic models for nucleic acid secondary structure are based on a de- 
composition of the base-pairing diagram of structures into distinct loops that are 
associated with empirically measured enthalpic and entropic terms that depend on 
loop sequence, length and type p31 [TO]. To obtain a better idea about these loops 
we give the diagram representations of three types of Loops L: 

• a Hairpin-loop (H) consists of a chord with a sequence of unpaired bases 
[i + 1, j — 1]. In particular, we have the restriction j — i > 4; if j — i = 4 
(tetra-loop); 

• an interior-loop (I) consists of two base pairs and and three 
sequence of unpaired bases [i + — 1], [ji + 1, j — 1] and [i\ — 1, j\ — 1]. 
The case of i + 1 = %\ — 1 and %\ — 1 = j\ — 1 is referred to as helix; 

• a multi-loop (M) is a sequence: 

{(hj), (ix,ji), • • • ; (ifc, Jfc), [Mi], b'fe-i + 1 ? - !]» b'fe + 1, j]) 
with sequences of unpaired bases [jk-i + 1, ifc — 1], for A; > 2, see Fig. HI 

The free energy G s of a secondary structure s is the sum of the energies of its 
constituent Loops L [10]. Thus the total free energy G s is given by G s = J2l&s 
This notion of energy allows one to compute, by means of dynamic programming 
[6] [21] the minimum free energy configurations as well as the partition function [11] 
as a weighted sum Q = ^ sgS e Gs ^ RT , where R is the universal gas constant, T is the 
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interior-loop 
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multi-loop 




FIGURE 4. It shows hairpin-loop H, interior-loop I and multi-loop M. 

temperature and e Gs ^ RT is the weight of sampling a secondary structure s with free 
energy G s . 

In the following we consider a notion of energy that does not take into account 
the specifics of nucleotides and only depends on the combinatorial representation 
the secondary structure. Our energy model is based on seven parameters 7 = 
{a i, «2, «3, /3i, 02, Ji, 72} defined as follows: 

• hairpin-loop H: 



(2.2) 



Gh ~ Oil + «2 V-h + «3 and Qh = v° H , 
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where Gh is the free energy of H, a± is the penalty for forming H, a 2 is the 
penalty for an unpaired base and «3 is a score associated to a tetra-loop. 
Furthermore £h denotes the number of unpaired bases and Qh is the weight 
of H. 

• Interior-loop /: 

(2.3) G/~A + ^/3 2 and Qj = v Gl 

where Gj is free energy of /, Pi is a favorable bonus of a helix and is the 
penalty of an unpaired base of I. Furthermore £i denotes the total number 
of unpaired bases of / and Qi is the weight of /. 

• Multi-loop M: 

(2.4) G M ~ 7l + B l2 + UxO and Q M = v Gl " 

Gm is the free energy of M, 71 is the penalty for the formation of a multiloop, 
B is the number of base pairs defining the multiloop (including the closing 
pair i ■ j), and U is the number of unpaired bases in the multiloop; 72 is the 
penalty for the base pair defining the multiloop and Qm is the weight of one 
M. 

This energy model induces the partition function 

(2.5) Q= ^ 
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where v 8 the weight for sampling a secondary structure with free energy G s . 



3. Energy filtration 



Theorem 1. The trivariate generating function C(z,v,p) counting RNA secondary 
structures having length n, energy i and j arcs, satisfies the following recursion: 

C(z,v,p) =pv ai z 2 Q ZV °^ a2 + z A v a3 - z A v Aa2 ^j 

(3,1> +P , * v ' c(^, P ) + P - ^'HCfe*^) 2 



zvfr) 2 v ' * ((l- zf -c(z,v,p)v^(i- z yy 

The proof is a straightforward exercise in symbolic methods and will consequently 
be omitted. 

We shall pass from the trivariate generating function to univariate ones by specifying 
the indeterminants v and p as follows: since there are a total of four nucleotides A, 
G, U and C and only six of base pairs, namely GC, CG, AU, UA, UG and GU, 
we set p = yg. This choice reflects the probability of randomly forming a (valid) base 
pair. We proceed similarly for the indeterminant v setting v = e«f ~ 1.843868184. 

These two choices induce the univariate generating functions C*(z) = Yl n >o c *{ n ) zn i 
= En>o s *( n ) 2 " an d the sets of S* and C*, which are weighted set of S and C, 
separately. Furthermore, c*(n) or s*(n) denote the summation of energy weight of 
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all the C*-structures or S*-structures with length n. 

We have c*(n) = 0, s*(n) = 0, for n = 0, . . . , 4; and c*(n) > 0, s*(n) > 0, for n > 5, 
see Claim of Lemma [2J 

Symbolic methods immediately imply 



Theorem 2. TTie bivariate generating function S*(z) is given by: 
(3.2) S*(z) 



l-(z+C*(z)) 
and furthermore, S*(z,t) is 

1 



(3.3) S *(z,t) = — 



(z + tC*(z))" 



The key for the following analysis is to study the dominant singularities of C*(z) and 
S*(z). It is their relative location and behavior that is responsible for the observed 
limit distributions as well as the effect of sparsification. 

We begin our analysis by making the following observations. Theorem [1] implies 
(3.4) w* 2 (z)C*(z) 2 + wl(z)C*(z) + w*(z) = 0, 
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where 

w* 2 {z) = (16^ 2 (1 - zf + 6zV 1+372 )(l - zv a2 )(l - zv? 2 ) 2 

- 6^V 1+72 (1 - z) 2 (l - zv° 2 ) 
w{{z) = 6zV x (l - zf(l - zv a2 ) - 6u ai - Hw+3aa z 5 (l - zv^ 2 ) 2 {\ - z) 2 + 6zV 1+72 

(v 4a2 - v a3 )(l - z) 2 (1 - zv a2 )(l - zv fh ) 2 - 16(1 - ^) 3 (1 - zv a2 )(l - zv? 2 ) 2 
w*(z) = 6v ai z 5 (l - z) 3 (l - zv? 2 ) 2 (v 3a2 - z(v 4a2 - v a3 )(l - zv a2 )) . 

We observe that only one solution of eq. f l3.4[) has the property c*(n) > 0, namely 

N*(z) 



(3.5) C*(z) 



D*(z) 



where N*(z) = -wt(z) + y/wl(z) 2 - Aw^(z)w^(z) and D*(z) = 2w£(z). 

Let p c * denote the radius of convergence of C*(z) and p s * denote the radius of 
convergence of S*(z). Then, clearly, < p c * < 1 and in view of C* C S*, we have 
< p s * < p c * < 1. 

• Let furthermore p r denote the minimum positive real root of odd order of the 
discriminant 

w* 1 (z) 2 -4w*(z)w*(z) = 0. 

• Let pd and p p denote the minimum positive real roots of the equations 

w* 2 (z) = and 1 - (z + C*(z)) = 0. 
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Theorem 3. (Pringsheim's theorem) [15]: If f(z) is representable at the origin 
by a series expansion that has non-negative coefficients and radius of convergence r, 
then the point z = r is a dominant singularity of f(z). 

Lemma 1. For the positive dominant real singularities of C*{z) and S*(z), p c * and 
p s * , we distinguish the following three cases: 

(I) : p c * = p r and p s * = p c *; 

(II) : p c * = p r and p s * = p p , where p s * < p c * ; 

(III) : p c * = p r and p s * = p p , where p s * = p c *. 



Proof. According to Theorem [31 we have p c * = mm{p r , pd}- 
Claim 0: We have Pd ^ p r - 

To prove Claim 0. We begin by observing that v iOL2 — v a3 < 0, since 0^2 < 0, > 
and v > 1, then for arbitrary real z, < z < 1, 

w *(z) = Qv ai z 5 (1 - zf(l - zv^f ji^ 2 - z {v Aa2 -v as ) (l - zv a2 )) > 0. 

>0 <0 >0 

Suppose pd = p r - Then, in view of eq. (13.51) 

w l{pd) = and wl(pd) 2 - Aw* 2 (p d )wl(p d ) = 

and consequently w\{pd) = 0. But in this case eq. ( 13.41) implies w^pd) = 0, which 
conflict with w^pd) > and Claim is proved. 
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First, we discuss the case pd < p r - 

Claim 1; pd is a removable singularity of C*(z). 

To prove Claim 1. Suppose first N*(p d ) ^ 0, since D*(p d ) = 0, thus lim 2 _ 5>Pd C*(z) = 
oo, so p<i is accordingly a pole of C*(z). 
In view of 



we have lim 2 _> Pd S*(z) = 0, this is impossible, since S*(z) = ^2 n - >Q s*{n)z n , ^ ne on ^ 

value of pd to make lim z ^ Pd S*(z) = is p d = 0. 

Thus N*(pd) = holds, we compute using D*(p d ) = 2w2(pd) = 0, 




N*( Pd ) 



—w 



i(Pd) + V w *(Pd) 2 - 4w*(p d )w*(p d ) 



wl(p d ) + csgn(wj(p d )) • (wl(p d )) = 0. 



Consequently, csgn(iu*(p(f)) is +1 and we have 



lim C*(z) 



2w* Q (p d ) w* (p d ) 



(-wl(pd) - wl(pd)) wl(pd)' 
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Since w%(p d ) = and p d ^ p r , hence wl(p d ) 2 - Awl(p d )wl(p d ) ^ 0, whence w*(p d ) ^ 
0. Using our previous observation Wo(Pd) > 0, we conclude 

hm C 2 = — — ^ 0, oo, 

which implies that p d is a removable singularity, whence Claim 1. 

Claim 2: z = p d e 10 , where < 9 < 2ir, is not a dominant singularity of C*(z). 
To prove Claim 2 we assume a contrario that z = p d e %e , < 9 < 2ir is a dominant 
singularity of C*(z). Then the convergence radius of C*(z) is and Theorem [3] 
implies that p d is also a dominant singularity of C*(z), which contradicts Claim 1, 
where we showed that p d is a removable singularity. 

Therefore, in case of p d < p r , p c * ^ p d . Consequently, Claim 1 and Claim 2 imply 

Pc* = Pr- 

We thus have the following two scenarios for p s * : 



(3.6) p s 



p r for p r < p p 

P P for p p < p r - 



and obtain the three cases: 

(I) : p c * = p r and p s * = p r ; 

(II) : p c * = p r and p s * = p p with p p < p r ; 
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(III): p c * = p r and p s * = p p with p p = p r 

and Lemma [1] follows. □ 
Lemma 2. The dominant singularity ofC*(z), p c * , is unique. 



Proof. Claim 0: C*(z) is aperiodic. 

It is clear that there always exists some irreducible secondary structure of length n > 
5. The coefficients of C*(z) are weighted sums of these structures and the weights 
are, by construction, always strictly positive. Therefore, C*(z) = J2 n>0 c *(n)z n has 
for n > 5 always strictly positive coefficients, i.e. c*(n) > 0. Hence, there exist three 
indices i < j < k such that c*(i)c*(j)c*(k) ^ and gcd(j — i, k — i) — 1, therefore 
C*(z) is aperiodic and Claim is proved. 

In view of eq. (13. 1H . we compute 
C(z,v,p) 

= pv ai z 2 (zv a2 ) 3 ^{zv a2 Y + (v a[i - v 4a2 )z 4 +pz 2 v> Sl C(z,v,p)(J2(^ 2 y) 2 

i>0 j>0 

(3.7) +pv^(z 2 v^)(C(z J v,p)v~ 12 ) 2 (J2 z k )J2 \C(z,v jP )v^J2 zt 

k>0 l>0 \ t>0 

Setting p = v = e^r and w = C*(z) = J2 n >o c *( n ) zn we obtain a power series 
equation 




(3.8) 



w = G(z, w), 
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where G(z,w) = J2 m n>o 9m,nZ m w n and g m ,n > 0. Indeed, since 0:3 > and a 2 < 0, 
we have v ia2 < v a3 and v a3 — v 4a2 > 0. The other coefficients in eq. (j2]) are all 
positive, i.e. v 7 > 0, CP = {ati, a 2 , 03, (3i, (3 2 , 71, 72}, implying g m n > 0, in particular, 
0o,i = 0. 

Furthermore, G(z, w) is bivariate power series which is absolutely convergent in a 
domain @, such that \z\ < R\, \w\ < R 2 - According to Theorem 9.4.4 of [5], there 
exist a unique function analytic in a neighborhood \z\ < p of z = 0, where p < R±, 
such that 

C*(0) = and G(z,C*(z))-C*(z) = 0, for \z\ < p, C*(z) < R 2 . 

Furthermore Theorem 9.4.6 of [S] shows that the radius of convergence, p = p c *, 
of the solution of eq. (I3.8p . C*(z), and the value w p = \im z ^. p C*(z) satisfy the 
equations 

w p = G(p, w p ) and 1 = G Wp (p, w p ). 
According to Lemma [H we have p c * = p r and thus 

(3.9) w p = hmC*(z) = C*(p r ) = ~ W }y r J 0,+oo. 



We next show that C*(z) has no other dominant singularities than p r . 
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To this end we note that C*(z) converges at point z = p r . Since c*(n) > 0, applying 
the triangular inequality, we have C*(p r e l9 ) < C*(p r ) Therefore C*(z) converges 
on the whole circle \z\ = p r . Since C*(z) is aperiodic and C*(z) = J2 n >o c *( n ) zTl is 
convergent power series for any z with \z\ = p r , the Daffodil Lemma of [3] implies 

\C*(p r e ie )\<C*(p r ). 

Taking the derivative in eq. f!3.8j) . we derive 

^c« W = ^ G(z ,c» W )A c . W + ; i G ( z ,c- W ). 

Thus we have 

GJz,w) 



(3.10) w z 



1 - G w (z,w)' 

which implies that C*(z) is indeed analytic as long as G w (z,w) ^ 1. 

Since G(z,w) has non-negative coefficients, it is monotonously increasing and so is 
G w (z,w). Therefore, for points p r e 10 , where < 6 < 2ir we have 

\G w (p r e ie ,C*(p r e* e ))\ < \G w (\Pre ie \,\C*(p r e w )\)\ < \G w (p r ,C*(p r ))\ = 1, 

which implies that C*(z) is analytic on the whole circle \z\ = p T except of p r . This 
proves that p c * = p r is unique. □ 
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Lemma 3. The dominant singularity ofS*(z) is unique and there are the following 
three cases: 

(I) : If p s * = p c * = Pn then p r is the unique dominant singularity ofS*(z); 

(II) : If p s * = Pp < Pr, then p p is the unique dominant singularity ofS*(z); 

(III) : If p s * = p p = p r , then p p is the unique dominant singularity ofS*(z); 
Furthermore, in case of (II) and (III), the degree of the root, p p of the equation 
1 — (z + C*(z)) = is exactly 1. 



Proof. According to Lemma [Q, we need to distinguish the cases (I) , (II) and (III) . 

(I) , here p r is the unique dominant singularity of C*(z). Then in view of eq. fl3]), 
the dominant singularity of S*(z) comes from the circle \z\ = p r , whence p r is also 
the unique dominant singularity of S*(z). 

(II) in this case we have p s * = p p < p c * and the dominant singularities of S*(z) all 
come from the circle \z\ = p p . We have 

S*(z)- X = 1- (z + C*(z)) 

and substituting z = p p , we derive 



{ Pp + C*{p p )) = l. 
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Then D*(z) = z + C*(z) = ^ n >o d*(n)2 n is an aperiodic power series with non- 
negative coefficients. According to Lemma [I], we have 



Since d*(n) > 0, applying the triangular inequality implies D*(p p e ld ) < D*(p p ) and 
D*(#) converges on the whole circle \z\ = p p . Since D*(z) is aperiodic the Daffodil 
Lemma of [3], guarantees 



Therefore, \T)*(p p e t6 )\ ^ 1, for < 9 < 2tt, whence the points on the circle \z\ = p p 
other than the point p p , are not singularities of S*(z). 

(Ill) here the dominant singularities of S*(z) all come from the circle \z\ = p p = p r . 
According (I) and (II), p p = p r is an unique dominant singularity of S*(z). 

In case of (II) and (III) and in view of eq. ([3]), the denominator of S*(z) is 



and 1 — (p p + C*(p p )) = 0. Taking the derivative of eq. (I3.12p . we obtain (1 + 
£C*{z)) = 0. Since C*{z) = ^ c*(n)z n , where c*(n) > 0, for n > 5, the 
derivative ^C*(z) = ^ n>1 n c*{n)z n ~ 1 1 has also nonnegative coefficients. As a 



(3.11) 



B*(p p ) = p p + C*( Pp ) = l. 



D*(p p e* & )\<D*(p p )<l. 



(3.12) 



l-(z + C*(z)) 
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result we have -£C*(p p ) > and consequently (l + j-C*(p p )) > 0. We conclude that 
p p is not a multiple root of eq. (I3.12p , which completes the proof of Lemma [31 □ 



4. The main result 

We consider the general composition scheme 

(4.1) 3 =Qo(uX) ^F(z,u) = g(uh(z)). 

Assume that g and h have non-negative coefficients and that h(0) = 0, so that the 
composition g(h(z)) is well-defined. We let p g and p h denote the radii of convergence 
of g and h, and define 

(4.2) T g = lim g(x) and Th = lim h(x). 

Definition 1. [3] The composition scheme F(z, t) = g(th(z)) is said to be subcritical 
if Th < p g , critical if Th = p g , and supercritical if > p g . 

We observe that 

(4.3) grM = — v ^— wrmg{tHl)) , 
where f(z) = t^-, g{w) = and h(z) = Furthermore, we set 

F(X n = t) =s*(n,t)/s*(n). 
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Since < p s * < p c * < 1, we restrict ourselves to the cases < pd,p r ,p P < 1- 

Theorem 4. The distribution of irreducible substructures within energy-filtered RNA 
secondary structures has the following distinct regimes: 

(a) the subcritical regime: Both dominant singularities z = p c * of C*(z) and 
z = p s * ofS*(z) are all exclusively a branch point (square root) singularity. Then 
F(X n = t) satisfies a discrete limit law and 



(b) the supercritical regime: The dominant singularity z = p c * of C*(z) is 
exclusively branch point singularity; the dominant singularity z = p s * of S*(z) is 
exclusively a pole; . Then the probability distribution of¥(X n = t), after standard- 
ization, satisfies a limiting Gaussian distribution and 



(c) the critical regime: The dominant singularity z = p c * ofC*(z) is exclusively 
branch point singularity, the dominant singularity z = p s * ofS*(z) is simultaneously 
a branch point singularity and a pole. Then F(X n = t) satisfies a local limit law 
whose density is a Rayleigh distribution and 






where x' > 0. 
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Proof. To prove the theorem we shall show how the three scenarios identified in 
Lemma [3] give rise to the three regimes. 

(a): Let us begin with case (I) of Lemma [31 where we have p c * = p s * = p r and 
p r < p p and both dominant singularities z = p c * of C*(z) and z = p s * of S*(z) are 
all exclusively a branch point (square root) singularity. 
We set 

t Q (z) = {wKzf-Awl^wKz^Kz-pr), 

t x {z) = (-wt(z))/(2w*(z)), 
and express C*(z) as 

(4.4) C*(z)=t 1 (z)+t 2 (z)(z-p r ) 1 i. 

The singular expansion of C*(z) is obtained from the regular expansion of ti(z) and 
the singular expansion of (z — p r )^t2(z). Consequently, 

(4.5) C*(Z) = hipr) + t 2 ( Pr )(z -Pr)?+ 0(Z - Pr ). 

In view of 0(z — p r ) = o((z — p r )^), Theorem VI. 3 [3] implies 

(4.6) [z n ]C*(z)^t 2 (p r )[z n ](z-p r )l 
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Using Theorem VI. 1 of [3], we obtain 

(4.7) [z n ]C*(z) ~ h ■ n"i • (p r )~ n ■ (1 + 0(-)), for some constant h. 

Th 

Analogously we compute the singular expansion of S*(z) as: 

(4.8) S*(z) = d 1 + d 2 {z — p r )i + 0[z — p r ), for some constants d\, d 2 . 
Employing Theorem VI. 1 of [3], 

(4.9) [z n ]S*(z) ~ k 2 ■ 7i-3 . (p r )- n (l + O(-)), for some constant fc 2 . 
Therefore, 



c*(n) fci • n~2 ■ (p r Y n 
4-10 lim -±± ~ -i 3 , = X > 0, 

We now have p r < p p , furthermore, we have (eq. f l4.3j) ) with p g = 1 and 

r h = hm . 

z ^p h 1 ~ z 

Since < p p < 1 is a root of 1 — (z+C*(z)) = 0, we observe p p + C*(p p ) = 1. C*(z) is 
a power series with positive coefficients and thus as a function over [0, 1[, continuous 
and monotone. As a result p r + C*(p r ) < 1 and Th = h(p r ) = C ^ r ' > < 1 = p g , 
i.e. S*(z, t) is governed by the subcritical paradigm. 
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According to Proposition IX. 1 [3], the quotient of the coefficients s*(n, k) and s*(n) 
satisfies 

s*(n k) kohT k ~^ 

(4.11) lim ' = q k , where q k = h . 

n->oo s*(n) g (r h ) 

Therefore the probability generating function of the limit distribution (g*,) is given 
by 

(4.12) q(t) t9 ' {Tht) 



s*(n) 



and f(X n = t) = s satisfies a discrete limit law as asserted. 



Ad (b): We have the case (II) of Lemma [31 we have p c * = p r , p s * = p p and p p < p r , 
the dominant singularity z = p c * of C*(z) is exclusively branch point singularity; 
the dominant singularity z = p s * of S* (z) is exclusively a pole with the relation. 

In analogy to our arguments in (a) we derive 

s*(n) ~ k 3 ■ 1 • (ps*)~ n ■ (1 + 0(-)), CJ! > 1 and u x G N, 

n 

c*(n) ~ fcm-i -( Pc .)" n -(l + 0(-)), 

n 

for constants &3, k\ and clearly, 

/ j 1 n\ c *( n ) h ■n-|(p e Q- rt fci _|/p,«y 

(4.13) — -— ~ — - — - = — • n 2 — ~ 0. 

s*(n) k 3 -l-(p s *)- n k 3 \p c *J 
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We now have p c * > p p = p s * and since for < p p < 1 as well as p v + C*(p p ) = 1, 
analogous monotonicity arguments imply p c * + C*(p c ») > 1. As a result 



i.e. S*(z,t) is governed by the supercritical paradigm. According to Proposition 
IX. 6 [3J, P(X n = k) = s*(n,k)/s*(n) satisfies a Gaussian limit law with speed of 
convergence 0(l/y / n). 

Ad (c): We consider finally case (III) of Lemma [3], where we have p c * = p r and 
Ps* = Pp with p p = p r , the dominant singularity z = p c * of C*(z) is exclusively 
branch point singularity, the dominant singularity z = p s * of S*(z) is simultaneously 
a branch point singularity and a pole. 
Then the singular expansion of S*(z) is given by: 



r h = h(p c *) 



c*(p c *) 

1 - Pc* 



>! = Pg, 



(4.14) 



S*(z) = dz + d i (z- p s ,) z+0{z- p s *) 



for constant ofa and d^. We compute 



(4.15) 



s*(n)~k 4 -n-^(p s *)- n -(l + 0(-)) 



and for constant k^. For c*(n) we derive the asymptotic expression 



(4.16) 



c*H~A; 1 -n-|-(p c ,)- ri -(l + 0(-)). 
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Consequently, 

(4.i7) ffi „ *.•»-;•(»)- =3 l , = fa 

SW fe4 • n - s • (p s .)- n n &4 



In view of p r = p„ we obtain ^-iM = 1 an d 

hf \ C *M 1 

rft = h(pr) = — = 1 = p g , 

1 p r 

whence S*(z, t) belongs to the critical paradigm. The critical composition is S*(z, t) = 
f(z)g(th(z)), where h(z) and g(z) have singular exponent A = | and A' = 1, where 
A < A'. We proceed by applying Proposition IX. 24 of [3], from which we can con- 
clude that the normalized r.v. X n /y/n satisfies a local limit law whose density is 
given by a Rayleigh law [2j [3] . To be specific we have 

< 418 > P <- Y » = *> = Mik ™ k - 

and the singular expansion of g(z), h(z) and g(h(z)) are respectively given by 

g(z) = (i-z)-\ 

h{ Z ) = Th -h 1 (l--) 1 2+0(l--), 
2 Pr Pr 

g(h(z)) = m - m_i(l - — ) - 3 + 0(1). 

2 Pr 



29 



Thus we obtain 

_ i 

(4.19) [z k ]g(z) ~ (1)"* and [z n ]g(h(z)) ~ m_i ■ ^ • (p r )-\ 

We next employ eq. (103) of Theorem IX. 16 of [3] for k = xni, where x is contained 
in any compact subinterval of (0, +oo). Then 

(4.20) [z n ]h(z) k ~ (p r )" n • - • Ray^i s; i), 

where Ray(x; |) = | exp(— ^) is the Rayleigh density function. Hence 

r(i) 1 1 
F(X n = k) ^ -=Ray(/iis; -) 

\/rj 2 2 

2 

— ' — ■ eX P( A J 

2m_i n An 

2 

and the proof of the theorem is complete. □ 



5. Discussion 



In this paper we demonstrated that the particular energy parametrization of RNA 
secondary structures affects the class of minimum free energy structures generated 
by DP-mfe folding algorithms in a subtle way. Minimal changes in parametrization 
can induce significant changes to the class of mfe-structures. We characterized 
the combinatorial impact of these changes in terms of the distribution of irreducible 
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substructures which has practical implications on algorithmic level, i.e. for the effect 
of sparsification of mfe DP-folding of such structures. 

We find the following dichotomy: the distribution of irreducible substructures either 
is a discrete limit law or a central limit law. In the former case mfe-structures 
contain only finitely many irreducible substructures and the ratio of irreducible mfe- 
structures over all mfe-structures becomes in the limit of long sequences a positive 
constant. While this means "just" a constant time reduction for the sparsification of 
the DP-routine, the reduction is typically in the order of 90 % [8] and consequently 
still of practical interest. In the latter case, the fact that the central limit distribution 
has a mean that scales linearly with sequence length alone implies the these mfe- 
structures contain a large number of very small irreducible substructures. As these 
irreducibles are small it becomes more and more unlikely to realize a mfe structure 
as an irreducible. As a result sparsification has somewhat "maximal" effect, i.e. a 
linear reduction in time and space |19j . 

From the work of [I] we know that a natural RNA structure has a finite 5'-3' distance. 
This means that natural RNA structures contain only a finite number of irreducible 
substructures. In the context of our dichotomy result this means that sparsification 
of "realistically" parameterized mfe structures leads to a constant time reduction, 
in accordance with the findings in [5]. 
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TABLE 1. Parameters of (a) (subcritical regime) and (b) (supercritical regime). 





a% 


0C2 


«3 


ft 


02 


7i 


72 


subcritical 


-5 


— 0.01 


7.53 


4 


-1 


-3.4 


-0.6 


supercritical 


-5 


— 0.01 


7.53 


2 


-1 


-10 


-3 



At the transition point, where the distribution shifts from a discrete to a central limit 
law, a local limit law exists. Its density function is that of a Rayleigh distribution. 
It is easy to "test" our main theorem for the subcritical and supercritical regimes, 
see Fig. |5l i.e. to sample the predicted limit laws. The particular parameters for the 
two scenarios displayed in Fig. [5] are given in Tab. [TJ By construction it is practically 
impossible to localize the transition and sample the Rayleigh law. 

In Fig. [6] we detail two typical RNA secondary structures sampled from the two 
regimes. Instead of presenting these structures as diagrams we map them into 
trees, where nodes represent irreducible substructures and edges are being drawn if 
irreducibles are nested. The tree representation shows clearly that a small variation 
of energy parameters can have a dramatic effect on the mfe structures. It is also 
evident why sparsification works much more efficient in the supercritical regime. 
Namely, in this case the irreducibles are less complex which implies that it becomes 
increasingly unlikely to find any candidates. 
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(a) 



(b) 



FIGURE 5. The subcritical regime (a): sampling 10 5 structures of length 
n = 700 (black) versus the discrete limit law (red) as predicted by The- 
orem [H The supercritical regime (b): sampling 10 6 structures of length 
n = 10 3 (black stars) versus the central limit distribution as predicted by 
Theorem [J] (blue line) . 



We are grateful to Fenix W.D. Huang for his help with the sampling curves of 
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FIGURE 6. Tree representation of structures in the two regimes: a typical 
subcritical structure (left) and a typical supercritical structure (right). One 
node in a tree represents an irreducible substructure and an edge visualizes 
the nesting relation. 
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